The dataset is from a company that is involved in peer-to-peer loaning called Prosper - the data has 113,937 rows over 81 variables.

When I looked into the data , I was keen to know the distribution of various variables - to help me get a sense of the data.


Univariate plots

In Univariate plots section, I am trying to get a sense of how the data is distributed and what each parameter in the dataset represents.

Plot 1.

Distribution of prosper alpha scores.

This is a plot of Prosper rating alpha score - which is a proprietory number assigned to a borrow-request from an Individual. This is assigned based on various parameters (more on this to follow). I wanted to see the distribution of this rating over the population. As expected , the rating system follows a normal pattern , which is expected of a population - with different economical backgrounds. Mean (4.072) > Median(4.00) therefore, the distribution leans towards higher prosper alpha , but it is almost normal.

Plot 2.

Ratio of HR rated loans.

Next , I was analysing the distribution of Credit scores in the loan data. I decided to create a new variable , credit.score.exp which is the average of Uppper + lower bound credit scores! Since credit score distribution is much like the distribution of alpha scores , I decided to look at non-performing loan distribution. I am plotting the distribution of Credit score , but only for HR rated loan.

This shows that the ratio of ‘HR’ loans to total number of loans decreases steadily with increase in credit score. As a matter of fact, higher credit scores (> 775) have zero high risk loans.

Plot 3.

Employment status - Distribution.

This is the distribution of the employment-status of the prosper borrowers. Prosper promises that even retired people can avail for a loan. The top 3 groups appear to be : 1. Employed 2. Full-Time 3. Self employed respectively.

Plot 4.

Employment status duration frequency.

Employment duration status looks to be positively skewed with Mean 96.07 > Median 67.00.

Plot 5.

Credit Lines frequency.

Current credit lines also follows a pretty normal distribution. This peaks at 59.00. ‘Employed’ category in general , have more credit lines open.

Plot 6.

Loan Listing category frequency.

The top 3 reason for borrowing are : 1. Debt consolidation 2. Other 3. Home Improvement. But if we include listing which are not classified , they make #2 highest. I wanted to explore the reason behind this.

Plot 7.

Loan status distribution.

Next , I was interested in looking into the distribution of the loan status. I found that most loans were ‘Current’ , followed by ‘Completed’ and ‘Chargedoff’. Now that I know how the loans are distributed, how does Prosper ‘Alpha’ rating play into this i.e Intiutively , I wanted to know , what does Alpha rating stand for - in a market place.


Plot 8.

Here , I decided to extract the loans by their creation date and plot borrowing activity by year. Once prosper hit the market in 2005 , it looks to have gained constant traction. Until it starts to fall in 2008 and plummets to its lowest value in 2009. This makes sense because of the SEC investigation which had halted prosper activities until prosper re-opened their new site (with a new Prosper rating system). More info about that here..

9.

## [1] "2" "1" "3"

This clearly states the unique months available during 2014. (Jan , Feb and Mar)

The sudden plummet is 2014 I found was due to the fact that , the data was only available till Mar-2014. But I am not interested in the statistic of whether the mean(actvitiy) for the first three month of Mar-2014 was higher than other years (in all likelyhood , it is).

Now , The reason the D’s may have been chargedoff than HRs is not very straight forward. If everything was perfect with the rating system , this shouldn’t have happened. Therefore, I decided to take a look at defaulting (‘Chargedoff’ ) rates for any given year.

Plot 10.

This is the ratio of defaulting/total number of loans for any given year. Initially without the Prosper system in place , the lenders might have invested in loans which have high Effective Return estimate - without understanding the risk; which explains the initial rise in chargeoffs(in 2007) and as more people understand the risks towards the end of 2007 (chargeoffs take atleast 150 days , therefore , I presume lender learn a good deal about the prosper environment within the first half of 2007) , the chargeoff rate looks to have dropped.

But after the SEC Investigation , Prosper with its new site & systems , should have minimized the % of chargeoffs, but this is not true until 2011. I have a hunch this is maily because Prosper’s ‘ALPHA’ rating algorithm lacks experience at this point and predicts much like the 2007-2009 algorithm , and not like one would want to. Alright , Let me see how it has fared in the years.

Plot 11.

I was pretty close. The distribution of Prosper’s ‘ALPHA’ ratings in 2011 and 2013 are almost mirror images of each other. There are lot more poor(’D’s,’E’s, and ’HR’s) alpha scores in 2011 and the general distribution is more towards the lower end. In 2013 , The better alpha scores , seemed to have gained a good strength. This makes sense - since not only are the lenders new to this system of lending by judging through Alpha scores, the available prospects themselves don’t help a lender!

Plot 12.

Here I’m removing the ‘C’ rated scores. The distribution now look almost like mirror imagfes of each other (in their ratings). Now that I’ve found the reason why ’D’s are shown are charged-off more than HR. During early years , D’s were simply avaiable more than any other loans. Given ’D’s are Medium risk - medium rewards loans , it is likely 60% of them were charged off, hence they make a more number than ’HR’s (Simply because they were available more).

Plot 13.

(Using Bi-variate plot here to explain the findings immediately above ^) Here we see, both more number of D’s as well as as more chargeoffs in D. But that keep changing with every year , as fewer loans are chargedoff and the distribution is becoming more normal.

Since I was already this far , I wanted to know how the chargeoff rate was corrected/managed by prosper. Surely, after initial stages of 2011 , the lenders might have gotten a good idea but , but how are lenders able to decide on which loans to invest i.e How is prosper facilitating investing on better quality of loans - making the distributon more normal ? Looking into the chargedoff loans might help.

Also , the peak in 2013 is due to Prosper starting Initial retirment attrangement - This had a minimum amount of $10,000 drastically improving the marketplace structure.

Plot 14.

The chargedoff indeed have gone down dramatically. I was curious how this was achieved. Has prosper helped investing in safer loans ? Or does it penalise investing in riskier loans. Infact in the year , 2013 , there is no chargeoffs from AA rated loan(Numeric - 7). This established how important an alpha score is while looking to lend.

Plot 15.

This is a plot of estimated returns for a HR loan , in the year 2011 and 2013. We can see the spread of return % to be much higher in the year 2011 than 2013. There are more return percents above the median too (in the year 2011). For the same HR category loan in 2011 and 2013 , the return is very different . This new prosper evaluation might have resulted in decreased chargeoff. (Estimated median return of 11.47 percent in 2011 vs 8.92% in 2013). Thus by reducing the estimate rate and making it closer to the next/previous prosper alpha categories , prosper sets an incentive to invest in a safer loan , while also making a good amount of profit(effective return).

Also , I found out the estimated effective yeild of better prosper scores were smaller compared to the worse ones. More on that in two-variable plot.

(Keeping two variable plot of returns vs year here)

Plot 16.

Debt to income ratio frequency.

Most people have debt-to income ratio in the 0.1 - 0.2 range. Mean 0.22 > Median 0.21 ,distribution is almost normal.

Plot 17.

Employment status distribution.

The top 3 income ranges are : 1. $25,000 - 49,000 2. $50,000 - 74,999 3. $100,000+

Plot 18.

State-wise distribution of borrower’s city (loan origin)

Most borrower’s seem to use prosper from CA , NY ,IL and TX.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

So far in Univariate plots , I’ve explored the distribution of various variables ,and how they fare. I am now interested in :

  1. How various other variables affect Prosper alpha scores.

  2. What Change in trends for Alpha scores mean for other variables.

I think exploring these would provide some valuable insights.

Did you create any new variables from existing variables in the dataset?

Yes. I created quite a few.

  1. Year : Extracting year from listing.creation.date (Datetime to date)

  2. credit.score.exp : Average of upper and lower credit score.

  3. closed.month : Month when the loan was closed. Relevant for chargedoff / Completed loans only. (Extracted)

  4. days.to.fund : loan.origination.date - listing.creation.date (Date from datetime)

  5. rate.bin : I used cut and created a categorical rate.bin variable from borrower.apr , this way the trends were much clearer.

  6. Status : Converted past-due(0-15 days , 15-30 days etc..) all to ‘Past due’

  7. Listing category : Used the legend from prosper and converted from number to corresponding category.

  8. Inquiries.bin : Using cut on variable total.inquiries (In two-variable plots)

  9. credit.lines : Using cut on open credit lines. (In two-variable plots)

I’ve also created multiple temporary dataframe , where I follow the following convention :

  1. x <- dataframe created for variable x
  2. x.groups <- I have use dplyr group_by on x variable .
  3. x.groups.y <- I have used dplyr group_by on x and subsetted variable y.

What is/are the main feature(s) of interest in your dataset?

My main feature of interest were the alpha scores. The alpha scores were decided using many other parameters and hence showed a strong trend if I tried plotting it with any other determining parameter. This is where I realised , plotting alpha vs any other parameter will pretty much yeild some result and will not help me discover any trends in data. Therefore , I’ve decided to see how each individual parameter affect the alpha score on their own.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes. Many of the plot wouldn’t have yeilded the results they did , if I plotted from the data straight up. For example , Making borrower rate a categorical variable from numerical , indicated how alpha scores give better rates Similarly , Plotting credit scores , that too bin-wise(again from categorical from numerical form) indicated how higher bins have higher alpha scores only , much more clearly than a scatter plot ever could. This kind of operation is done throughout the analysis , even in two variable plots.

Other aspect of this is removing outliers using ‘quantile’ function - where I could observe a much stronger correlation and hence , better looking plots.

Also, various operation for converting column from one form (numeric –> Characters and vice versa) are done to support the immediate need. I’ve also mentioned creating new dataframe temporarily on need-only-basis so as to not modify the main data and I’ve also given the convention I’ve used for it

Two variable plots

Observations :

Here, I intend to compare how well the Prosper’s Alpha score has performed and observe the trends in in the score by changing other variables and how the alpha score changes. Using this , I intend to establish relation between various variables and how they affect ‘alpha’ scores and use these variables in guaging the strength of a loan.

Observation 1.

Credit score distribution (By alpha).

I’m removing all the outliers (< 5% and > 95% quantile except for ‘AA’ rated individuals ) This is because most AA rated individual had exceptionally high scores ( > 0.99 percentile) - and removing > 95th qunatile showed that the number of ‘AA’ had halved .

This plot gives one important insight - as credit score increases , the prosper rating looks to be increasing for the better. After 774 credit score, exclusively ‘AA’ rated loans are available. Also , the ratio of ‘HR’ rated loans to the total number of loans looks to be decreasing as the credit range increases. The not so good prosper scores too , on the other hand (HR,E,D,C) seem to be decreasing with an increase in credit score. Also , on a closer look , every alpha scores are normal distributed themselves.

Observation 2.

Until 2008 (and including some parts of 2008) some of the loans listing reasons were not listed or bothered with. Hence the huge number of NA’s. This explains the unlisted loans.

Observation 3.

Defaulting rate by prosper alpha score.

This is as expected. A prosper ‘HR’ loan is always going to default (‘Chargedoff’) more than a ‘AA’ loan , which hardly look to have defaulted. And this indicates the effectiveness of Prosper’s ‘Alpha’ rating. But I see something amiss. ’D’s seem to have been Chargedoff more than ’HR’s (Albiet by very little!). Looking into prosper activites may give some useful insights into this.

I decided I’ll look into Prosper activites from the very beginning. This should clear up any discrepancies I have about the data.

Observation 4.

In both the years, full-time and part-time employees seemed to have defaulted the least compartively. Having a good source of income and employment looks like a good parameter to take into account while computing prosper alpha rating.

Observation 5.

Borrowers’ rate for alpha score.

This is the distribution of borrower’s APR by their Alpha score. It is clear that the higher alpha scores get lower APR. This is because not only do they pay a smaller fee compared to the other categories , they also have smaller interest rates! (since lender gets fewer yields due to the fact that risk/reward is little).

Observation 6.

% of home-owners is higher for better alpha scores and looks to be receding for worse alpha scores.

Observation 7.

Evidently , the ratio of homeowners is higher for higher prosper alpha scores. HR rated loans have more % of homeowners compared to D and E rated loans , which I found out was because most loans were rated Hr due to income source being unverified.

Observation 9.

Income verifiable by prosper score. % of verifiable income per alpha score is higher for better alpha scores. Let me verify by calculating % of loans with income unverified.

Observation 10.

AA rated scores on average have 3 in 100 loans without income verified , whereas for HR rated loans , nearly 17.5% of the loans have income unverified. Clearly this is a major factor in determining alpha score.


I primarily use the Observations sub-category under Bi-variate plots because I think I can use these insights in bi-variate plots. There were few plots which had really piqued my interest - namely plots {Observation 1 , Plot 2 , Observation 3 , Plot 11 , Plot 12 , Plot 14 , Observation 5 , Observation 10 } where I used the ‘fill’ parameter for alpha scores and these indicated a strong trend.

From my exploration , I find that -

  1. Employment status
  2. Income range
  3. Status of income Verification (Verified / Unverified)
  4. Homeowner (Y / N)
  5. Debt to income ratio
  6. Monthly loan payments and
  7. Stated monthly income

indicate pretty strong trend for all the prosper scores. Therefore , these may decide other factors of the loan (like borrower rate , days to fund etc. ) . In two-variable plots , I intend to explore these. From these observation and some Univariate plots , I observed some variables indicate a strong trend(both negative and positive) with changes in Prosper alpha ratings. Therefore , in future plots , I think I can use these to compare how well a loan performs for a given alpha rating also , check how various parameters affect alpha ratings!

1.What are the effective yeilds of various alpha scores?

Go back

After analysing HR loans from 2011 and 2013 , this was pretty straight foward. Prosper offers higher effective yeild to higher risk loans. But , the estimated effective yeild - is very bad for HR loans. (15.75 median loss rate) - i.e a large portion of the promised return (offered from borrower) is lost to prosper fees. Also , loss % is very high , due to the nature of the risk. Conviniently prosper offers a minimum of $25 to invest in a given loan. Thus , among HR loans , lenders can invest in those which offer smaller yeilds (compared to the ones which offer better yield - which in case of HR would only mean more risk) - this is only supported by shortening of spread in the previous graphs (Boxplot with spread of return % in 2011 vs 2013). Also, this helps diversify lender investments.
This helps in diversifying investments making the distribution more normal.

Correlation between Prosper rating and Effective yeild is : -0.96 which is pretty strong. Meaning , higher prosper rating give lower returns.

Also , the \(r^{2}\) value is 0.9339 , meaning most of the variance in the estimated effective yield is explained by the prosper alpha scores.

2. Loan original amount vs Debt to income ratio relation :

This is as expected. People have higher debt-income ratio because they have higher loan amount.

But they are not strong correlated (r value of 0.1352 and \(r^{2}\) value of 0.01829).

3.Monthly loan payment vs Debt to income ratio :

This is also true to logic. The more debt you have , the more amount you pay back monthly, for a given term - clear from the increasing medians. (Faceting for term because , overall debt vs montly loan payment would be similar with different economic classes.)

This indicates a correlation of 0.19 , which is not very strong. Also , I tried to find the \(r^{2}\) value term wise and the maximum value was 0.03671.

4.Credit Score plots

4.1.How does credit score affect borrower rate

This is also as expected. Prosper rating alpha takes into account credit score , thus lower credit scores (and hence poorer alpha scores) get higher interest.

I removed higher credit scores from the data point temporarily - since higher scores ( > 774) looked to have similiar deals in rate. The data has a correlation score of -0.35 and \(r^{2}\) is 0.1253. Although this has a strong correlation , this is a case where the statistics couldn’t explain the patterns in the data , which was obvious in the plot!

5.Credit Lines

5.1 Distribution of credit lines by Prosper alpha score.

The rating also , looks to be independent on the amount of credit line one has - my guess was rating would decrease with increasing credit lines. But the credit lines themselves are normally distributed.

5.2 How does open credit lines affect borrower rate?

This one was pretty suprising to me. I expected if a borrower had more credit lines , he risk of defaulting would be re-assessed and rate would be increased. But this was not true. I found out this was because people with higher credit lines are usually one who are doing well for themselves(higher salary , homeowner , verifiable source of income etc.) I don’t expect any sort of correlation between the two.

5.3 Borrower rate vs Total / Open Credit lines

Like my previous observations , regardless of credit lines open , one of the key determining factor for borrower rate turns out to be Prosper alpha score. Even if some credit lines are not currently open (Not paid for / borrowed from) , it doesn’t not seem to affect the borrower rate.

My guess was higher values of total/open credit lines would have meant more delinquency / default rate , therefore , would be assigned poorer prosper score. But this is not given much consideration.

5.4 Open credit lines vs Delinquencies

Then I decided to look into the delinquencies one might have had due to the number of credit lines open. I expected that if one had more credit lines open , he is more likely to fall short of completing monthly payments. However , this was also not true. This I found out is because credit lines are given to buyer based on their credit scores (assessed on various paramaters by prosper) - hence, the % of default in every category is equally likely. More insights here

Pearson’s r is : -0.16 and \(r^{2}\) is 0.02621 , which only proves empirically what I had guessed above.

6. Inquiries made

6.1 Inquiries made frequency (by alpha score)

This was one of the more insightful plots. I found inquiries into credit score, is very important indicator on prosper score. An important matter to notice here is the ratio of better prosper alpha scores decreasing and worse prosper alpha scores increasing with increase in number of inquiries. This actually makes sense. If a person is desperate for a loan , he will use multiple sources , each of which will look into credit score, and prompt a inquiry. More of this subject here. Also, for loans with no previous inquiry(first loan attempt ever) - the number of ‘AA’ rating is maximum - which only reinforces my hunch. (Refer - the NA bin!)

Also ,This clearly show the decreasing number of AA (better) rated loans with increasing number of inquiries. Also , the NA bin represents ‘0’ inquiries made previously , and we can see the maximum ratio of the better prosper score to the total prosper scores.

6.2 Inquiries made for HR loans.

Shows the increasing % of HR rated loans to the total number of loans with increase in inquiries.

Inquiries is a pretty good factor in determining prosper alpha rating - correlation of -0.16 (More number of inquiries –> Poorer alpha score) and \(r^{2}\) value of 0.02853.

6.3 % of chargedoff loans with inquiries.

The % of defaulting loans , also increases with increase in number of inquiries. (As it is evident from the plot)

6.4 Inquiries made vs borrower rate (Bin wise)

Borrower rates increases steadily albiet by very little for increase in number of inquiries.

Inquiries has a r value of 0.141 , and \(r^{2}\) value of 0.02009.

7. Estimated loss by Prosper alpha scores

% of money lost by reinvesting in similiar type of loan over a period of one year. Clearly low-risk loans have smaller loses than high-risk loans.

This also has a correlation co-efficient of 0.96 (+ve , exact opposite of effective yield vs alpha scores too). \(r^{2}\) value is 0.9296. This is extremely high.

8. Monthly loan payments vs Current Credit Lines

As people have more credit lines , they pay more principal to prosper , as expected. Because of more bank card utilization!

Correlation score indicates a positive correlation of 0.18 while \(r^{2}\) value is 0.03493 , which is very low - meaning current credit lines don’t explain the monthly loan payment very well.

9. Bank card utilization.

Bank card utilization is the ratio of : amount pending before current cycle / Total credit available. It is clear from this plot , Lower risk loans have lower bank card utilization loans. It is helped by the fact that , they are less likely to delinquent and also have more credit lines on average(as discussed above).

Also a small portion of HR rated loans looked to have better bank card utilization rates than D and E rated loans - and on analysing ‘income verifiable’ parameter , it was clear there were fewer people with verified income in D and E category.

This has a pretty good r value of 0.26 while \(r^{2}\) value is very low 0.07033.

10.Defaulting rate vs Income

From this plot , it is clear people with poorer income default more on average than people with higher income. I think I’ll look into this more in multi-variate plot adding Montly loan payment as a third parameter.

For defaulting loans , This has a strong correlation of -0.944 and \(r^{2}\) value is 0.8811.

11. Income range vs is.borrower.homeowner.

People with higher income are more likely to be homeowners. This is as expected. This is the reason this is used as axiom in determining prosper alpha score. (See after income of $50,000 , there are more homeowners than not!).

After that , I wanted to see if loans with high debt to income ratio , have home ownership. But suprisingly , % of home ownership was proportional throughout various debt to income ratios.

Between Income vs Homeownership , the correlation is 0.2584 which is pretty strong and \(r^{2}\) value is 0.06678. This would mean other the variance cannot be solely explained by debt to income ratio but other factors (Such as income range, credit lines open ) need to be considered.

12. Debt to income ratio vs Prosper alpha scores.

However, people with high debt to income ratio are less likely to get good alpha score than.

13. Homeownership by Debt to income ratio.

There are no strong tends here. All the bins have a good proportion of homeowners ,indicating , higher debt to income ratio might have been from people who have a good source of income , but might have borrower more!

14. Public record :

From this plot , I observe , loans with public record have on average 4 - 5 % borrower rate higher than loans with no public record. However, as the number of public record increase , this borrower rate remains unaffected.

This is the proportion of ‘D’ rated loans for a given public record(bin). I notice as a general trend , poorer rated loans increase in % with increase in public record (Chances of defaulting is more).

This has a r-value of -0.48 and \(r^{2}\) value of 0.06145. This is because even though loans without public record have median rate of 18% for borrowing , loans with public records have 22%-24% rate , which is not much of a increase even for much larger increase in number of public records.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

These were the strong trends I saw :

  1. Borrowing rate increase steadily with drop in credit score. Infact , for every 15 drop in average credit score, there is a noticeable drop is borrower rate. (Correlation of -0.35 as well!).

  2. It is possible to credit to some extent if a person might delinquent , depending on the increasing number of credit lines open. (r = -0.16 , this value however , significantly pulled down by people having lower credit lines open - who make the majority).

  3. Inquiries made by a borrower is one important factor in determining the health of the loan.It is also one of the most important factor in determining prosper alpha score (a desperate borrower has more inquiries). Also, one can expect borrowing rate to steadily increase with increase in inquiries , and the loan is also expected to default more often.

  4. Taking into consideration only defaulting loans , one of the key indicators of if a loan is more likely to default is to check the borrower monthly income. Plotting this bin-wise - and segrgating only defaulting loans , yeilded a correlation score for -0.944 , i.e users with higher income are likely to default less and vice versa

  5. Debt to income ratio , to some extent could tell if a borrower might / might not be a homeowner. However, the variance in the data is not entirely explained by this. Other key factors for determining the alpha scores should also be considered.

  6. For a loan having a public record ( bankruptcy etc..) , it is very likely to get higher rate of interst than those which do not - correlation of -0.48. However , any increase in number of such public record have little effects.

However , some of the plots were pretty suprising in their revelation :

  1. I expected Borrower rate to increase with credit lines open , however , this was not the case at all. Infact, borrower rate looks to be almost independent of credit lines open.

  2. Also , homeowner-ship is not exactly determined by debt-to-income ratio as I found out here.

What was the strongest relationship you found?

Estimated loss by Prosper alpha score and estimated yield by alpha score both have -0.96 and 0.96 correlation co-efficient respectively.

Also , given a monthly income , it is easier to say how many % of such loans might default , based on a -0.94 correlation coefficient.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Multi-variate

Loan status by borrower-APR :

Borrower A.P.R vs Monthly loan payment by count :

Most of the defaulted loans have very little monthly loan payment (around 80 percentile having less than $600 of monthly loan payment). Also , borrower APR is between 10%- 40% , therefore, small loans with high APR are more likely to default. It is very possible people take smaller loans to satisfy their immediate need/want or this is all they could afford to. Either way , this is a big red flag.

Borrower A.P.R vs Monthly loan payment by Debt to income ratio :

This plot was pretty insightful. For each monthly payments . both completed and chargedoff-ed loans have similiar borrower rate. The determining factor was however , debt to income ratio. Complted loans look to have more debt to income ratio in the range 0.1-0.2 , while the more greener ones(0.3-0.4) are in the defaulted loans.

Days to finance loan :

On average, most loans are funded within 20 days of their listing. AA rated loans in most cases (after 2010) are funded within the first 10 days. Suprisingly median for AA rated loans is 10 days , while that of B,C rated loan is 7 days! I decided to see why -

From the figure , one important revelation is that - loans which give 10%-30% yields are sanctioned much faster than loans with <10% estimated effective yeild. This is understandable , since even lenders who look to ‘diversify’ their lending as discussed here , will want to maximise their profit. Also these are the loans with moderate amount of risks - hence they can take a gamble at these. Also notable is how high risk loans (30%-40% effective yield) take nearly 30 days to sanction (if at all they are sanctioned!).

Homeowner determination :

People who have low stated monthly income and high debt to income ratio have no home ownership. This can be used as a key-factor in determining homeownership and hence , prosper score , since the trends are very strong here.

Estimated effective yields for Alpha scores by year :

This is as I had observed earlier. Priort to 2011 , the prosper alpha score wasn’t really indicative of the risk of a loan - the effective yeild looks to have no steady returns. However , after 2011 , the alpha score look to have corresponding ‘levels’ of effective yeilds -as observed in 2 variable plot.

Borrower rate vs Loan original amount : By credit score.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I noticed a strong relation between : 1. Monthly loan payment vs Stated Monthly income 2. Monthly loan payment vs Stated monthly income and how they caused defaulting! 3. Borrower APR vs Monthly loan payment to determine - loan status(completed or defaulted!) 4. Borrower APR vs Monthly loan payment by Debt to income ratio - which was the most intuitive graph I feel! 5. Days to finance by Prosper rating alpha. 6. Monthly loan payment vs Stated-monthly-income to determine defaulting using borrower rate! 7. Determining homeowner status using Monthly loan payment vs stated monthly income and debt-to-income ratio.

Were there any interesting or surprising interactions between features?

Yes , there were couple of them. From my two variable plots I concluded income verifiable is one of the major factos in determining prosper alpha score. Therefore I expected more defaulted to have income.verifiable set to False. However, the trend was just not observable in the graph.

Also, borrower rates , I expected might have been significantly higher for defaulted loans - since - the prosper alpha scores of them might have been

Higher prosper score I noticed, actually took longer to get financed - because of the fact that the reward is poor compared to B or C category loans! # Final Plots and Summary

Plot One

This shows the importance of the Prosper ‘alpha’ scores. If I had to pick one parameter indicative of the health/risk of the loan - Prosper alpha scores would be it. Although before 2008 , the alpha scores (or numeric scores) didn’t really matter all that much , after 2008 , especially after 2011 - prosper alpha rating looks to consitently provide good information about the risk of the loan (or lack thereof!).

This plot shows how loans which have lower stated monthly income and higher loan payment are high risk ones(HR and E) - while , loans which have higher stated monthly income are usually low risk loans!

Plot Two

From this plot , I think one can figure out if a loan can default or not - looking at the APR vs Monthly loan payment. Most of the high APR - vs Low Monthly loan payments looks to have defaulted - especially if they have high debt-to-income ratio. This is probably because , a major part of their income is consumed for repaying the debt amount!

Plot Three

This is a plot of Borrower rate vs Loan original amount by credit score(average). I noticed that for people with lower credit score , the borrower rate is higher than people with better credit scores for the same amount borrowed. This difference is very apparent - for example , in the year 2008 , a credit score 700 would get a $5000 loan for < 10% interest ,but a person with average credit score of 600 would get the same loan for nearly 35% interest rate. However , the difference is much better normalized after 2011 , where the prosper alpha scores better represent the status of a loan than they did previously!

I choose these three multi-variate plots because , they give a good amount of information about :

  1. Strength of Prosper alpha score.

  2. Insights on how a loan might default (or if they might) and

  3. What one can expect from a loan - based on their credit score?


Reflection

A major portion of EDA on Prosper data , I feel , was spent on understanding the variables and how they affected the data. Once I was done with this process, the next step was to understand how two(or more) variables are correlated or if they might be correlated. Sometimes these correlations were obvious - like the ones involving the ‘alpha’ scores - since they were indicative of a loan and their status ; Sometimes , I had to dig deeper - clean , categorize , bin and remove some data. There were also places where I found out interesting facts , places I would have never bothered to look otherwise.

Major challenge in this EDA was the size of the dataset - particularly, 81 variables. There is a huge possibility of each variable affecting every other variable in some way and it was important I discover the important ones and also the ones where the extent of this ‘effect’ is significant. Also , Understanding how the variables represent real world data - and how the system works was a important task - since without this ‘domain’ knowledge , I wouldn’t have been able to compare and identify the important variables.

I feel I have uncovered most of the important correlations possible - (but not all of them ofcourse!) - but most importantly , I feel I have explored good insights about the loans , borrowers and lenders and how they have faired over the years - using distributions. I’ve also I feel , removed neccessary data at the placed needed(temporarily) and cleaned accordingly to give suitable plots.

But this EDA can be improved in few ways :

  1. In Two-Variable plots , I unconvered few key relations between alpha scores and other variables (For ex. income verifiable , monthly loan payment etc..) , and some of them showed strong correlation as well. I planned to decide which loans to ‘choose’ (from a lender’s perspective) even for given prosper alpha scores and I was mostly succesful at it. However , all I could do was One-on-One comparision between variables and I could not get a sense of how all the variables could influence a loan outcome together i.e a linear model (Since this required a huge amount of RAM and I couldn’t complete the process at all).

  2. I’ve selectively chosen plots and variables which I feel are imporant , after plotting all the variables and discarding the ones which was not of much interest. It is entirely possible I might have missed some key insights. But I feel I have done a pretty decent job here. * **